8 research outputs found

    T-Crowd: Effective Crowdsourcing for Tabular Data

    Full text link
    Crowdsourcing employs human workers to solve computer-hard problems, such as data cleaning, entity resolution, and sentiment analysis. When crowdsourcing tabular data, e.g., the attribute values of an entity set, a worker's answers on the different attributes (e.g., the nationality and age of a celebrity star) are often treated independently. This assumption is not always true and can lead to suboptimal crowdsourcing performance. In this paper, we present the T-Crowd system, which takes into consideration the intricate relationships among tasks, in order to converge faster to their true values. Particularly, T-Crowd integrates each worker's answers on different attributes to effectively learn his/her trustworthiness and the true data values. The attribute relationship information is also used to guide task allocation to workers. Finally, T-Crowd seamlessly supports categorical and continuous attributes, which are the two main datatypes found in typical databases. Our extensive experiments on real and synthetic datasets show that T-Crowd outperforms state-of-the-art methods in terms of truth inference and reducing the cost of crowdsourcing

    Truth Inference in Crowdsourcing: Is the Problem Solved?

    Get PDF
    Crowdsourcing has emerged as a novel problem-solving paradigm, which facilitates addressing problems that are hard for computers, e.g., entity resolution and sentiment analysis. However, due to the openness of crowdsourcing, workers may yield low-quality answers, and a redundancy-based method is widely employed, which first assigns each task to multiple workers and then infers the correct answer (called truth) for the task based on the answers of the assigned workers. A fundamental problem in this method is Truth Inference, which decides how to effectively infer the truth. Recently, the database community and data mining community independently study this problem and propose various algorithms. However, these algorithms are not compared extensively under the same framework and it is hard for practitioners to select appropriate algorithms. To alleviate this problem, we provide a detailed survey on 17 existing algorithms and perform a comprehensive evaluation using 5 real datasets. We make all codes and datasets public for future research. Through experiments we find that existing algorithms are not stable across different datasets and there is no algorithm that outperforms others consistently. We believe that the truth inference problem is not fully solved, and identify the limitations of existing algorithms and point out promising research directions

    Managing the quality of crowdsourced databases

    No full text
    Many important data management and analytics tasks cannot be completely addressed by automated processes. For example, entity resolution, sentiment analysis, and image recognition can be enhanced through the use of human input. Crowdsourcing platforms are an effective way to harness the capabilities of the crowd to apply human computation for such tasks. In recent years, crowdsourced data management has become an area of increasing interest in research and industry. Typical crowd workers are often associated with a large variety of expertise, background, and quality. As such, the crowdsourced database, which collects information from these workers, may be highly noisy and inaccurate. Thus it is of utter importance to manage the quality of crowdsourced databases. In this thesis, we identify and address two fundamental problems in crowdsourced quality management: (1) Task Assignment, which selects suitable tasks and assigns to appropriate crowd workers; (2) Truth Inference, which aggregates answers obtained from crowd workers to infer the final result. For the task assignment problem, we consider two common settings adopted in existing crowdsourcing solutions: task-based and worker-based. In the task-based setting, given a pool of n tasks, we are interested in which of the k tasks should be assigned to a worker. A poor assignment may not only waste time and money, but may also hurt the quality of a crowdsourcing application that depends on the workers' answers. We propose to consider evaluation metrics (e.g., Accuracy and F-score) that are relevant to an application and we explore how to optimally assign tasks in an online manner. In the worker-based setting, given a monetary budget and a set of workers, we study how workers should be selected, such that the tasks in hand can be accomplished successfully and economically. We observe that this is related to the aggregation of workers' qualities, and propose a solution that optimally aggregates the qualities from different workers, which is fundamental to selecting workers. For the truth inference problem, although there exist extensive solutions, we find that they are not compared extensively under the same framework, and it is hard for practitioners to select appropriate ones. We conduct a detailed survey on 17 existing solutions, and provide an in-depth analysis from various perspectives. Finally, we integrate the task assignment and truth inference in a unified framework, and apply them to two crowdsourcing applications, namely image tagging and question answering. For image tagging, where a worker is asked to answer the task, we select the correct label(s) among multiple given choices. We identify workers' unique characteristics in answering multi-label tasks, and study how it can help to solve the two problems. For question answering, where workers may have diverse qualities across different domains. For example, a worker who is a basketball fan should have better quality for the task of labeling a photo related to 'Stephen Curry' than the one related to 'Leonardo DiCaprio'. We leverage domain knowledge to accurately model a worker’s quality, and apply them to addressing the two problems.published_or_final_versionComputer ScienceDoctoralDoctor of Philosoph

    KB-Enabled Query Recommendation for Long-Tail Queries

    No full text
    International audienceIn recent years, query recommendation algorithms have been designed to provide related queries for search engine users. Most of these solutions, which perform extensive analysis of users' search history (or query logs), are largely insufficient for long-tail queries that rarely appear in query logs. To handle such queries, we study a new solution, which makes use of a knowledge base (or KB), such as YAGO and Freebase. A KB is a rich information source that describes how real-world entities are connected. We extract entities from a query, and use these entities to explore new ones in the KB. Those discovered entities are then used to suggest new queries to the user. As shown in our experiments, our approach provides better recommendation results for long-tail queries than existing solutions

    Crowdsourced data management: hybrid machine-human computing

    No full text

    Entity-Based Query Recommendation for Long-Tail Queries

    No full text
    International audienceQuery recommendation, which suggests related queries to search engine users, has attracted a lot of attention in recent years. Most of the existing solutions, which perform analysis of users’ search history (or query logs ), are often insufficient for long-tail queries that rarely appear in query logs. To handle such queries, we study the use of entities found in queries to provide recommendations. Specifically, we extract entities from a query, and use these entities to explore new ones by consulting an information source. The discovered entities are then used to suggest new queries to the user. In this article, we examine two information sources: (1) a knowledge base (or KB), such as YAGO and Freebase; and (2) a click log, which contains the URLs accessed by a query user. We study how to use these sources to find new entities useful for query recommendation. We further study a hybrid framework that integrates different query recommendation methods effectively. As shown in the experiments, our proposed approaches provide better recommendations than existing solutions for long-tail queries. In addition, our query recommendation process takes less than 100ms to complete. Thus, our solution is suitable for providing online query recommendation services for search engines
    corecore